The Program for International Student Assessment (PISA) is a survey of students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test that focuses on examining how well students have learnt the school curriculum; rather, it is an international assessment that tests how well students are prepared for life beyond school.
There are 5233 students in 2 economies, all Non-OECD member countries that took part in the PISA assessment of reading, mathematics and science literacy. In addition to that, the dataset also consist of 636 variables/features describing each student’s background, personality and academic performance. Some of the variables were repeated and some were condensed, so effort was made to select the important features/variables necessary for the analysis. Some of the selected features/variables include: the student's country, age, grade, gender; whether they are male or female, if they repeated a grade or not, perseverance level, immigration status, class management, teacher support, student-teacher relation, parents highest level of education and parent's highest education years to mention a few. The dataset was gathered from Udacity hosted site available here. In addition, the PISA data dictionary can also be obtained from here.
Furthermore, the various variables enumerated above were explored to understand the relationship between/among them. This further provide insights into the student's personality and highlight the factors that would affect their academic performance. To achieve these, several questions were asked, some of which include:
- What are the factors responsible for student's academic performance in school?
- Do students who learned for long hours perform better academically than those who learned for short hours?
- What is the impact of class repetition on student's academic performance?
- Does the immigration status of a student affect his/her academic performance?
- What is the relationship between students who give up easily and their academic performance?
- Could immigration status and student's perseverance level to give up easily hinder students from perform well academically?
- Does classroom management have an impact on student's academic performance?
- What attributes can be ascribed to student's failure in their academic studies?
- How does teacher's suport influence the academic performance of students?
- Have students learnt the school curriculum well enough?
- What is the impact of student-teacher relationship on student's academic achievement?
- Does good academic achievement of student implies that students are prepared for life within and outside school?
These questions and many more will be addressed in this report. Moreso, data visualizations to further shed light on the analysis will be included.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
The next step is to load in the "PISA" dataset and describe its properties.
# load in the dataset into a pandas dataframe
pisa = pd.read_csv('pisa.csv', sep= ",")
I will also use both visual assessment and programmatic assessement to assess the data.
Before then, I need to assess the dataset to gather more information such as the number of columns, rows, the statistics of each data in each dataframe, shapes, data types and others.
# high-level overview of data shape and composition
print(pisa.shape)
print('\n')
print(pisa.dtypes)
pisa.head(10)
(5233, 636)
Unnamed: 0 int64
CNT object
SUBNATIO int64
STRATUM object
OECD object
...
W_FSTR80 float64
WVARSTRR float64
VAR_UNIT float64
SENWGT_STU float64
VER_STU object
Length: 636, dtype: object
| Unnamed: 0 | CNT | SUBNATIO | STRATUM | OECD | NC | SCHOOLID | STIDSTD | ST01Q01 | ST02Q01 | ... | W_FSTR75 | W_FSTR76 | W_FSTR77 | W_FSTR78 | W_FSTR79 | W_FSTR80 | WVARSTRR | VAR_UNIT | SENWGT_STU | VER_STU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 1 | 10 | 1 | ... | 13.7954 | 13.9235 | 13.1249 | 13.1249 | 4.3389 | 13.0829 | 19.0 | 1.0 | 0.2098 | 22NOV13 |
| 1 | 2 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 2 | 10 | 1 | ... | 13.7954 | 13.9235 | 13.1249 | 13.1249 | 4.3389 | 13.0829 | 19.0 | 1.0 | 0.2098 | 22NOV13 |
| 2 | 3 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 3 | 9 | 1 | ... | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 19.0 | 1.0 | 0.1999 | 22NOV13 |
| 3 | 4 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 4 | 9 | 1 | ... | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 19.0 | 1.0 | 0.1999 | 22NOV13 |
| 4 | 5 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 5 | 9 | 1 | ... | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 19.0 | 1.0 | 0.1999 | 22NOV13 |
| 5 | 6 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 6 | 9 | 1 | ... | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 19.0 | 1.0 | 0.1999 | 22NOV13 |
| 6 | 7 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 7 | 10 | 1 | ... | 13.7954 | 13.9235 | 13.1249 | 13.1249 | 4.3389 | 13.0829 | 19.0 | 1.0 | 0.2098 | 22NOV13 |
| 7 | 8 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 8 | 10 | 1 | ... | 14.4599 | 14.6374 | 15.8728 | 15.8728 | 5.2248 | 15.2579 | 19.0 | 1.0 | 0.2322 | 22NOV13 |
| 8 | 9 | Albania | 80000 | ALB0006 | Non-OECD | Albania | 1 | 9 | 9 | 1 | ... | 12.7307 | 12.7307 | 12.7307 | 12.7307 | 4.2436 | 12.7307 | 19.0 | 1.0 | 0.1999 | 22NOV13 |
| 9 | 10 | Albania | 80000 | ALB0005 | Non-OECD | Albania | 2 | 10 | 10 | 1 | ... | 3.3844 | 10.1533 | 3.3844 | 10.1533 | 10.1533 | 10.1533 | 74.0 | 2.0 | 0.1594 | 22NOV13 |
10 rows × 636 columns
# Get more information on the dataset
pisa.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5233 entries, 0 to 5232 Columns: 636 entries, Unnamed: 0 to VER_STU dtypes: float64(353), int64(14), object(269) memory usage: 25.4+ MB
Due to the large number of columns in the PISA dataset, this function could not list out the total columns in the PISA dataset.
# Check an overview on the statistics of the dataset
pisa.describe()
| Unnamed: 0 | SUBNATIO | SCHOOLID | STIDSTD | ST01Q01 | ST02Q01 | ST03Q01 | ST03Q02 | ST06Q01 | ST115Q01 | ... | W_FSTR74 | W_FSTR75 | W_FSTR76 | W_FSTR77 | W_FSTR78 | W_FSTR79 | W_FSTR80 | WVARSTRR | VAR_UNIT | SENWGT_STU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5233.00000 | 5.233000e+03 | 5233.000000 | 5233.000000 | 5233.000000 | 5233.000000 | 5233.000000 | 5233.0 | 4996.000000 | 4884.000000 | ... | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 |
| mean | 2617.00000 | 8.066337e+05 | 95.288171 | 2172.881903 | 9.623925 | 1.124403 | 6.315116 | 1996.0 | 6.361689 | 1.242424 | ... | 8.429316 | 8.323052 | 8.328039 | 8.367237 | 8.165812 | 8.254512 | 8.274102 | 40.869839 | 1.514717 | 0.197386 |
| std | 1510.78131 | 2.260922e+06 | 62.783150 | 1444.015784 | 0.579974 | 0.463090 | 3.387933 | 0.0 | 0.782878 | 0.531800 | ... | 6.887186 | 7.019799 | 7.176734 | 6.196027 | 6.149730 | 6.824609 | 6.311611 | 23.434703 | 0.519707 | 0.111031 |
| min | 1.00000 | 8.000000e+04 | 1.000000 | 1.000000 | 7.000000 | 1.000000 | 1.000000 | 1996.0 | 4.000000 | 1.000000 | ... | 0.309100 | 0.309600 | 0.309600 | 0.309100 | 0.309100 | 0.309100 | 0.318000 | 1.000000 | 1.000000 | 0.023500 |
| 25% | 1309.00000 | 8.000000e+04 | 35.000000 | 819.000000 | 9.000000 | 1.000000 | 3.000000 | 1996.0 | 6.000000 | 1.000000 | ... | 3.857700 | 3.837000 | 3.837600 | 3.823600 | 3.820800 | 3.854500 | 3.871500 | 20.000000 | 1.000000 | 0.146700 |
| 50% | 2617.00000 | 8.000000e+04 | 95.000000 | 2127.000000 | 10.000000 | 1.000000 | 6.000000 | 1996.0 | 6.000000 | 1.000000 | ... | 6.056800 | 5.577700 | 5.419900 | 6.647000 | 5.577700 | 5.388100 | 5.702850 | 41.000000 | 2.000000 | 0.187800 |
| 75% | 3925.00000 | 8.000000e+04 | 149.000000 | 3435.000000 | 10.000000 | 1.000000 | 9.000000 | 1996.0 | 7.000000 | 1.000000 | ... | 11.963600 | 11.957500 | 11.963600 | 12.244325 | 11.957500 | 11.916100 | 11.999300 | 62.000000 | 2.000000 | 0.221200 |
| max | 5233.00000 | 7.840200e+06 | 204.000000 | 4743.000000 | 12.000000 | 4.000000 | 12.000000 | 1996.0 | 16.000000 | 4.000000 | ... | 75.555200 | 75.555200 | 75.555200 | 63.491300 | 63.491300 | 75.555200 | 61.031500 | 80.000000 | 3.000000 | 1.140200 |
8 rows × 367 columns
# Check for missing data in the pisa dataset
pisa.isnull().sum().any()
True
# Check for sum of missing data for rows and columns in the pisa dataset
print(sum(pisa.isnull().values.any(axis=0)))
# Check for sum of missing data for rows in the pisa dataset
sum(pisa.isnull().values.any(axis=1))
617
5233
# Check for the sum of duplicated values
sum(pisa.duplicated())
0
# Check the unique values of REPEAT
pisa.REPEAT.unique()
array(['Did not repeat a <grade>', nan, 'Repeated a <grade>'],
dtype=object)
# Check the unique values of AGE
pisa['AGE'].unique()
array([16.17, 15.58, 15.67, 15.5 , 16.08, 15.83, 15.92, 16. , 15.75,
16.25, 15.33, 15.42, 16.33, nan])
# Check the unique values of PARED
pisa['PARED'].unique()
array([12., 16., 10., 3., 6., 15., nan, 9., 5.])
# Check the unique values of OECD
pisa.OECD.unique()
# Since the unique value here is only one, signifying that all students are from countries in OECD.
array(['Non-OECD'], dtype=object)
# Check the unique values of TIMEINT (Time of computer use (mins))
pisa.TIMEINT.unique()
array([nan])
# Check the unique values of ICTSCH (ICT Availability at School)
pisa.ICTSCH.unique()
array([nan])
# Check the unique values of ST44Q03 (Attributions to Failure - Teacher Did Not Explain Well)
pisa.ST44Q03.unique()
array(['Slightly likely', 'Likely', nan, 'Not at all likely',
'Very Likely'], dtype=object)
# Check the unique values of ST44Q07 (Attributions to Failure - Teacher Did not Get Students Interested)
pisa.ST44Q07.unique()
array(['Likely', 'Slightly likely', 'Very Likely', nan,
'Not at all likely'], dtype=object)
# Check the unique values of ST85Q01 (Classroom Management - Students Listen)
pisa.ST85Q01.unique()
array(['Agree', nan, 'Strongly agree', 'Strongly disagree', 'Disagree'],
dtype=object)
# Check the unique values of ST83Q04 (Teacher Support - Let Us Know We Have to Work Hard)
pisa.ST83Q04.value_counts()
Strongly agree 1620 Agree 1155 Disagree 145 Strongly disagree 69 Name: ST83Q04, dtype: int64
# Check the unique values of STUDREL (Student-Teacher relation)
pisa.STUDREL.unique()
array([-1.04, nan, -0.02, 0.81, 1.13, 1.51, 2.16, 0.45, -0.48,
0.52, 0.53, -1.26, -1.47, -0.79, -0.64, 0.59, -3.11, -0.55,
0.71, -1.91, -0.05, 0. , 0.95, 0.57, 1.35, -0.56, 1.38,
-0.54, 2.04, -0.03, -0.08, -1.23, -1.68, -0.15, 2.09, -0.9 ,
-1.46, 0.01, -0.66, 0.98, -0.97, -2.16, 1.72, 2.06, -0.6 ,
1.42, 0.93, 0.99, 1. , 2.02, -1.48, -2.5 ])
# Check the unique values of HISCED
pisa.HISCED.unique()
array(['ISCED 3A, ISCED 4', 'ISCED 5A, 6', 'ISCED 3B, C', 'ISCED 2',
'ISCED 5B', 'None', 'ISCED 1', nan], dtype=object)
# Check the unique values of ST44Q03
pisa.ST44Q03.unique()
array(['Slightly likely', 'Likely', nan, 'Not at all likely',
'Very Likely'], dtype=object)
# Check the unique values of ST44Q07
pisa.ST44Q07.unique()
array(['Likely', 'Slightly likely', 'Very Likely', nan,
'Not at all likely'], dtype=object)
Condense the plausible values in mathematics ['PV1MATH','PV2MATH','PV3MATH', 'PV4MATH', 'PV5MATH'], in reading ['PV1READ','PV2READ','PV3READ', 'PV4READ', 'PV5READ'] and in science ['PV1SCIE','PV2SCIE','PV3SCIE','PV4SCIE','PV5SCIE'] by finding the mean. Thereafter rename the mean of each subject as 'Average_math_literacy', 'Average_reading_literacy' and 'Average_science_literacy' respectively.
Condense the Learning time(minutes per week) columns for Science, text language and mathematics ('SMINS','LMINS','MMINS' respectively) by finding the mean of these and renaming it in a column called Average_learning_time
In this section, all issues documented will be addressed and cleaned.
# Before cleaning, create a copy of the dataframe so as to get back to the original dataset in case the need arises.
df_pisa =pisa.copy()
Find the average of the 5 plausible variables for each subjects below
and save each average created for each subjects as
# For Mathematics
df_pisa[['PV1MATH','PV2MATH','PV3MATH', 'PV4MATH', 'PV5MATH']].describe()
| PV1MATH | PV2MATH | PV3MATH | PV4MATH | PV5MATH | |
|---|---|---|---|---|---|
| count | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 |
| mean | 400.173338 | 399.932003 | 400.436557 | 399.857713 | 399.097832 |
| std | 93.527959 | 93.219210 | 94.077933 | 94.012758 | 94.018127 |
| min | 62.400700 | 60.998600 | 53.910300 | 66.373300 | 37.085200 |
| 25% | 340.559300 | 338.222500 | 339.858200 | 340.637200 | 340.773525 |
| 50% | 402.173200 | 402.173200 | 400.887950 | 401.628000 | 398.901700 |
| 75% | 461.528300 | 460.866150 | 461.314075 | 462.015075 | 459.931450 |
| max | 692.794800 | 719.278700 | 751.215100 | 717.798700 | 690.302200 |
df_pisa['Average_math_literacy'] = df_pisa[['PV1MATH','PV2MATH','PV3MATH', 'PV4MATH', 'PV5MATH']].mean(axis=1)
df_pisa['Average_math_literacy']
0 366.18634
1 470.56396
2 505.53824
3 449.45476
4 385.50398
...
5228 574.70790
5229 608.98114
5230 435.43388
5231 646.99332
5232 NaN
Name: Average_math_literacy, Length: 5233, dtype: float64
# For Reading
df_pisa[['PV1READ','PV2READ','PV3READ', 'PV4READ', 'PV5READ']].describe()
| PV1READ | PV2READ | PV3READ | PV4READ | PV5READ | |
|---|---|---|---|---|---|
| count | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 |
| mean | 401.928187 | 402.319508 | 400.444503 | 402.404353 | 400.930271 |
| std | 115.084330 | 115.105851 | 115.013770 | 115.515613 | 115.768850 |
| min | 0.083400 | 3.109300 | 2.387600 | 4.849200 | 2.307400 |
| 25% | 330.363050 | 330.774525 | 330.516400 | 330.595800 | 332.728775 |
| 50% | 410.265100 | 408.438200 | 407.369800 | 409.907700 | 406.589600 |
| 75% | 479.926100 | 480.958700 | 478.337500 | 481.148800 | 481.180700 |
| max | 742.048500 | 784.146900 | 734.899700 | 770.891600 | 796.233000 |
df_pisa['Average_reading_literacy'] = df_pisa[['PV1READ','PV2READ','PV3READ', 'PV4READ', 'PV5READ']].mean(axis=1)
df_pisa['Average_reading_literacy']
0 261.01424
1 384.68832
2 405.18154
3 477.46376
4 256.01010
...
5228 509.37726
5229 523.65190
5230 423.56914
5231 553.16346
5232 NaN
Name: Average_reading_literacy, Length: 5233, dtype: float64
# For Science
df_pisa[['PV1SCIE','PV2SCIE','PV3SCIE','PV4SCIE','PV5SCIE']].describe()
| PV1SCIE | PV2SCIE | PV3SCIE | PV4SCIE | PV5SCIE | |
|---|---|---|---|---|---|
| count | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 | 5232.000000 |
| mean | 405.494542 | 405.818382 | 405.153235 | 405.548367 | 405.180683 |
| std | 100.250409 | 99.622204 | 100.855630 | 100.411146 | 99.224778 |
| min | 39.668000 | 22.417000 | 40.134300 | 34.912300 | 40.134300 |
| 25% | 343.659100 | 346.643000 | 344.778100 | 345.407500 | 345.710600 |
| 50% | 408.420350 | 409.865700 | 408.187200 | 408.840000 | 407.814250 |
| 75% | 471.876200 | 470.570700 | 472.901900 | 470.011200 | 470.663900 |
| max | 726.725100 | 747.612800 | 744.815400 | 770.925000 | 764.397600 |
df_pisa['Average_science_literacy'] = df_pisa[['PV1SCIE','PV2SCIE','PV3SCIE','PV4SCIE','PV5SCIE']].mean(axis=1)
df_pisa['Average_science_literacy']
0 371.91348
1 478.12382
2 486.60946
3 453.97240
4 367.15778
...
5228 493.78966
5229 513.18540
5230 414.52818
5231 609.41812
5232 NaN
Name: Average_science_literacy, Length: 5233, dtype: float64
df_pisa['Academic_performance'] = df_pisa[['Average_math_literacy','Average_reading_literacy','Average_science_literacy']].mean(axis=1)
df_pisa[['Average_math_literacy','Average_reading_literacy','Average_science_literacy','Academic_performance']].head()
| Average_math_literacy | Average_reading_literacy | Average_science_literacy | Academic_performance | |
|---|---|---|---|---|
| 0 | 366.18634 | 261.01424 | 371.91348 | 333.038020 |
| 1 | 470.56396 | 384.68832 | 478.12382 | 444.458700 |
| 2 | 505.53824 | 405.18154 | 486.60946 | 465.776413 |
| 3 | 449.45476 | 477.46376 | 453.97240 | 460.296973 |
| 4 | 385.50398 | 256.01010 | 367.15778 | 336.223953 |
df_pisa[['LMINS','MMINS','SMINS']].describe()
| LMINS | MMINS | SMINS | |
|---|---|---|---|
| count | 2876.000000 | 2885.000000 | 2815.000000 |
| mean | 184.541377 | 181.512305 | 162.776554 |
| std | 68.125593 | 68.513792 | 108.082043 |
| min | 40.000000 | 40.000000 | 0.000000 |
| 25% | 135.000000 | 135.000000 | 90.000000 |
| 50% | 180.000000 | 180.000000 | 90.000000 |
| 75% | 225.000000 | 225.000000 | 270.000000 |
| max | 900.000000 | 1320.000000 | 1200.000000 |
df_pisa['Average_learning_time'] = df_pisa[['LMINS','MMINS','SMINS']].mean(axis=1)
df_pisa['Average_learning_time']
0 NaN
1 225.0
2 300.0
3 120.0
4 NaN
...
5228 NaN
5229 495.0
5230 NaN
5231 787.5
5232 NaN
Name: Average_learning_time, Length: 5233, dtype: float64
df_pisa[['LMINS','MMINS','SMINS','Average_learning_time']].head()
| LMINS | MMINS | SMINS | Average_learning_time | |
|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN |
| 1 | 315.0 | 270.0 | 90.0 | 225.0 |
| 2 | 300.0 | NaN | NaN | 300.0 |
| 3 | 135.0 | 135.0 | 90.0 | 120.0 |
| 4 | NaN | NaN | NaN | NaN |
# Selecting the important variables as relevant to the analysis.
df_pisa = df_pisa[['CNT','ST04Q01','AGE','GRADE','Average_math_literacy','Average_reading_literacy','Average_science_literacy',
'Academic_performance','Average_learning_time','REPEAT', 'IMMIG','ST93Q01', 'HISCED','HISEI','PARED',
'ST29Q06','ST88Q01','TEACHSUP', 'STUDREL','ST85Q02','ST85Q03','ST83Q02','ST83Q03','ST86Q02',
'ST86Q03','ST86Q04','ST44Q03','ST44Q07']]
where
df_pisa.rename(columns = {'CNT' :'Country',
'ST04Q01' : 'Gender',
'AGE' : 'Age',
'GRADE' : 'Grade',
'IMMIG' : 'Immigration_status',
'ST93Q01' : 'Perseverance_Give_up_easily',
'HISCED' : 'Highest_educational_level_parents',
'HISEI' : 'Highest_parental_occupational_status',
'TEACHSUP': 'Teacher_support',
'ST83Q02' : 'Teacher_support_help_when_needed',
'ST83Q03' : 'Teacher_support_help_learn',
'STUDREL' : 'Student_teacher_relation',
'REPEAT' : 'Class_repetition',
'PARED' : 'Highest_parental_education_years',
'ST44Q03' : 'Teacher_did_not_explain_well',
'ST44Q07' : 'Teacher_did_not_get_students_interested',
'ST29Q06' : 'Math_interest',
'ST88Q01' : 'School_does_little_to_prepare_me_for_life',
'ST85Q02' : 'Class_management_teacher_keep_class_orderly',
'ST85Q03' : 'Class_management_teacher_starts_on_time',
'ST86Q02' : 'Student_teacher_relation_teachers_are_interested',
'ST86Q03' : 'Student_teacher_relation_teachers_listen_to_students',
'ST86Q04' : 'Student_teacher_relation_teachers_help_students'}, inplace = True);
df_pisa.head()
| Country | Gender | Age | Grade | Average_math_literacy | Average_reading_literacy | Average_science_literacy | Academic_performance | Average_learning_time | Class_repetition | ... | Student_teacher_relation | Class_management_teacher_keep_class_orderly | Class_management_teacher_starts_on_time | Teacher_support_help_when_needed | Teacher_support_help_learn | Student_teacher_relation_teachers_are_interested | Student_teacher_relation_teachers_listen_to_students | Student_teacher_relation_teachers_help_students | Teacher_did_not_explain_well | Teacher_did_not_get_students_interested | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albania | Female | 16.17 | 0.0 | 366.18634 | 261.01424 | 371.91348 | 333.038020 | NaN | Did not repeat a <grade> | ... | -1.04 | Strongly disagree | Disagree | Agree | Agree | Strongly disagree | Agree | Agree | Slightly likely | Likely |
| 1 | Albania | Female | 16.17 | 0.0 | 470.56396 | 384.68832 | 478.12382 | 444.458700 | 225.0 | Did not repeat a <grade> | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Slightly likely | Slightly likely |
| 2 | Albania | Female | 15.58 | -1.0 | 505.53824 | 405.18154 | 486.60946 | 465.776413 | 300.0 | Did not repeat a <grade> | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Likely | Very Likely |
| 3 | Albania | Female | 15.67 | -1.0 | 449.45476 | 477.46376 | 453.97240 | 460.296973 | 120.0 | Did not repeat a <grade> | ... | NaN | NaN | NaN | Strongly agree | Strongly agree | NaN | NaN | NaN | NaN | NaN |
| 4 | Albania | Female | 15.50 | -1.0 | 385.50398 | 256.01010 | 367.15778 | 336.223953 | NaN | Did not repeat a <grade> | ... | -0.02 | Agree | Strongly agree | Agree | Strongly agree | Agree | Agree | Agree | Likely | Slightly likely |
5 rows × 28 columns
df_pisa.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5233 entries, 0 to 5232 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 5233 non-null object 1 Gender 5233 non-null object 2 Age 5232 non-null float64 3 Grade 5232 non-null float64 4 Average_math_literacy 5232 non-null float64 5 Average_reading_literacy 5232 non-null float64 6 Average_science_literacy 5232 non-null float64 7 Academic_performance 5232 non-null float64 8 Average_learning_time 2937 non-null float64 9 Class_repetition 4833 non-null object 10 Immigration_status 4812 non-null object 11 Perseverance_Give_up_easily 2918 non-null object 12 Highest_educational_level_parents 5225 non-null object 13 Highest_parental_occupational_status 424 non-null float64 14 Highest_parental_education_years 5225 non-null float64 15 Math_interest 2951 non-null object 16 School_does_little_to_prepare_me_for_life 2966 non-null object 17 Teacher_support 3021 non-null float64 18 Student_teacher_relation 3001 non-null float64 19 Class_management_teacher_keep_class_orderly 2986 non-null object 20 Class_management_teacher_starts_on_time 2988 non-null object 21 Teacher_support_help_when_needed 2988 non-null object 22 Teacher_support_help_learn 2984 non-null object 23 Student_teacher_relation_teachers_are_interested 2973 non-null object 24 Student_teacher_relation_teachers_listen_to_students 2978 non-null object 25 Student_teacher_relation_teachers_help_students 2972 non-null object 26 Teacher_did_not_explain_well 2914 non-null object 27 Teacher_did_not_get_students_interested 2906 non-null object dtypes: float64(11), object(17) memory usage: 1.1+ MB
for i in df_pisa.columns:
if(df_pisa[i].isna().sum()/len(df_pisa)) * 100 > 90:
df_pisa.drop(i, axis=1,inplace=True)
else:
pass
print(df_pisa.shape)
df_pisa.info()
(5233, 27) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5233 entries, 0 to 5232 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 5233 non-null object 1 Gender 5233 non-null object 2 Age 5232 non-null float64 3 Grade 5232 non-null float64 4 Average_math_literacy 5232 non-null float64 5 Average_reading_literacy 5232 non-null float64 6 Average_science_literacy 5232 non-null float64 7 Academic_performance 5232 non-null float64 8 Average_learning_time 2937 non-null float64 9 Class_repetition 4833 non-null object 10 Immigration_status 4812 non-null object 11 Perseverance_Give_up_easily 2918 non-null object 12 Highest_educational_level_parents 5225 non-null object 13 Highest_parental_education_years 5225 non-null float64 14 Math_interest 2951 non-null object 15 School_does_little_to_prepare_me_for_life 2966 non-null object 16 Teacher_support 3021 non-null float64 17 Student_teacher_relation 3001 non-null float64 18 Class_management_teacher_keep_class_orderly 2986 non-null object 19 Class_management_teacher_starts_on_time 2988 non-null object 20 Teacher_support_help_when_needed 2988 non-null object 21 Teacher_support_help_learn 2984 non-null object 22 Student_teacher_relation_teachers_are_interested 2973 non-null object 23 Student_teacher_relation_teachers_listen_to_students 2978 non-null object 24 Student_teacher_relation_teachers_help_students 2972 non-null object 25 Teacher_did_not_explain_well 2914 non-null object 26 Teacher_did_not_get_students_interested 2906 non-null object dtypes: float64(10), object(17) memory usage: 1.1+ MB
df_pisa.Class_repetition = df_pisa.Class_repetition.str.replace('<',"");
df_pisa.Class_repetition = df_pisa.Class_repetition.str.strip('>');
df_pisa.Teacher_did_not_explain_well = df_pisa.Teacher_did_not_explain_well.str.replace('Very Likely', 'Very likely')
df_pisa.Teacher_did_not_get_students_interested = df_pisa.Teacher_did_not_get_students_interested.str.replace('Very Likely',
'Very likely')
df_pisa.Highest_educational_level_parents = df_pisa.Highest_educational_level_parents.str.replace('ISCED 3A, ISCED 4',
"ISCED 3A, 4");
df_pisa.Class_repetition.value_counts()
Did not repeat a grade 4601 Repeated a grade 232 Name: Class_repetition, dtype: int64
df_pisa.Teacher_did_not_explain_well.unique()
array(['Slightly likely', 'Likely', nan, 'Not at all likely',
'Very likely'], dtype=object)
df_pisa.Teacher_did_not_get_students_interested.unique()
array(['Likely', 'Slightly likely', 'Very likely', nan,
'Not at all likely'], dtype=object)
df_pisa.Highest_educational_level_parents.value_counts()
ISCED 3A, 4 2141 ISCED 5A, 6 1392 ISCED 2 780 None 368 ISCED 5B 341 ISCED 3B, C 152 ISCED 1 51 Name: Highest_educational_level_parents, dtype: int64
# replacing the nan in Age column with minimum number of that column
df_pisa['Age'] = df_pisa['Age'].fillna(df_pisa['Age'].min())
# Changing the datatype from float to integer
df_pisa['Age'] = df_pisa['Age'].astype(int)
# replacing the nan in Highest_parental_education_years column with minimum number of that column
df_pisa['Highest_parental_education_years'] = df_pisa['Highest_parental_education_years'].fillna(df_pisa['Highest_parental_education_years'].min())
# Changing the datatype from float to integer
df_pisa['Highest_parental_education_years'] = df_pisa['Highest_parental_education_years'].astype(int)
df_pisa['Age'].unique()
array([16, 15])
df_pisa['Highest_parental_education_years'].unique()
array([12, 16, 10, 3, 6, 15, 9, 5])
Save gathered, assessed, and cleaned PISA dataset to a CSV file named "df_pisa_clean.csv".
Save the df_pisa dataset to a CSV file named df_pisa_clean.csv
# Saving the df_pisa to a new dataframe
df_pisa_clean = pd.DataFrame(df_pisa)
# Save the dataframe into a csv file format
df_pisa_clean.to_csv('df_pisa_clean.csv')
df_pisa_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5233 entries, 0 to 5232 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 5233 non-null object 1 Gender 5233 non-null object 2 Age 5233 non-null int32 3 Grade 5232 non-null float64 4 Average_math_literacy 5232 non-null float64 5 Average_reading_literacy 5232 non-null float64 6 Average_science_literacy 5232 non-null float64 7 Academic_performance 5232 non-null float64 8 Average_learning_time 2937 non-null float64 9 Class_repetition 4833 non-null object 10 Immigration_status 4812 non-null object 11 Perseverance_Give_up_easily 2918 non-null object 12 Highest_educational_level_parents 5225 non-null object 13 Highest_parental_education_years 5233 non-null int32 14 Math_interest 2951 non-null object 15 School_does_little_to_prepare_me_for_life 2966 non-null object 16 Teacher_support 3021 non-null float64 17 Student_teacher_relation 3001 non-null float64 18 Class_management_teacher_keep_class_orderly 2986 non-null object 19 Class_management_teacher_starts_on_time 2988 non-null object 20 Teacher_support_help_when_needed 2988 non-null object 21 Teacher_support_help_learn 2984 non-null object 22 Student_teacher_relation_teachers_are_interested 2973 non-null object 23 Student_teacher_relation_teachers_listen_to_students 2978 non-null object 24 Student_teacher_relation_teachers_help_students 2972 non-null object 25 Teacher_did_not_explain_well 2914 non-null object 26 Teacher_did_not_get_students_interested 2906 non-null object dtypes: float64(8), int32(2), object(17) memory usage: 1.0+ MB
# Check the shape of the data after selecting the relevant variables.
df_pisa_clean.shape
(5233, 27)
In the pisa dataset, the survey of 5233 students were taken with 636 features describing each student's characteristics, personalities, background and subject literacy. In the variables examined, some are categorical(qualitative) variables and some are numeric(quantitative) variables.The categorical variables are more of ordered factor variables than the nominal variables. In similar manner, the numeric variables are also more of continuous variables than discrete variables. Due to the enormous variables present in the dataset (636), only variables (27) relevant to the analysis intended to be studied were selected.
The main feature of interest in my dataset is the Academic performance of student which is an engineered variable. I am mostly interested in how different factors(variables) influence/affect the academic performance of the students'.
I believe the engineered variable "average learning time" will have strong influence on the academic performance of student. Moreso, factors such as class repetition, class management, student-teacher relation, teacher support, attitude to school, teacher factor, perserverance level to give up easily, immigration status and highest educational level of parents will support my investigation in my feature of interest.
In this section, the wrangled data will be analyzed and visualized. I will be computing statistics and creating visualizations. Moreso, I will be asking some research questions and providing solutions to the questions by computing the relevant statistics and visualizing them on various plots. Before then, there is the need to load the df_pisa_clean.csv file into a dataframe.
df_pisa_clean.csv file to a dataframe for further analysis.# Loading the csv file to a dataframe
df_pisa_clean = pd.read_csv('df_pisa_clean.csv')
# There is tendency to get another unnamed column index 0 when reading the "df_pisa_clean.csv" file.
# Drop this unnamed column index using the code below.
df_pisa_clean.drop('Unnamed: 0',axis=1, inplace=True)
# Source: https://stackoverflow.com/questions/44620465/why-did-reset-indexdrop-true-function-unwantedly-remove-column
df_pisa_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5233 entries, 0 to 5232 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 5233 non-null object 1 Gender 5233 non-null object 2 Age 5233 non-null int64 3 Grade 5232 non-null float64 4 Average_math_literacy 5232 non-null float64 5 Average_reading_literacy 5232 non-null float64 6 Average_science_literacy 5232 non-null float64 7 Academic_performance 5232 non-null float64 8 Average_learning_time 2937 non-null float64 9 Class_repetition 4833 non-null object 10 Immigration_status 4812 non-null object 11 Perseverance_Give_up_easily 2918 non-null object 12 Highest_educational_level_parents 5225 non-null object 13 Highest_parental_education_years 5233 non-null int64 14 Math_interest 2951 non-null object 15 School_does_little_to_prepare_me_for_life 2966 non-null object 16 Teacher_support 3021 non-null float64 17 Student_teacher_relation 3001 non-null float64 18 Class_management_teacher_keep_class_orderly 2986 non-null object 19 Class_management_teacher_starts_on_time 2988 non-null object 20 Teacher_support_help_when_needed 2988 non-null object 21 Teacher_support_help_learn 2984 non-null object 22 Student_teacher_relation_teachers_are_interested 2973 non-null object 23 Student_teacher_relation_teachers_listen_to_students 2978 non-null object 24 Student_teacher_relation_teachers_help_students 2972 non-null object 25 Teacher_did_not_explain_well 2914 non-null object 26 Teacher_did_not_get_students_interested 2906 non-null object dtypes: float64(8), int64(2), object(17) memory usage: 1.1+ MB
Here, individual variables in the
df_pisa_cleandataset will be investigated and the trend in each variable examined will be visualized via plots.
# Since I will be looking at many categorical variables, it is expedient to create a loop for the categorical variables.
base_color = sb.color_palette()[2]
def univariate_plot(var, sort=None):
print(df_pisa_clean[var].value_counts());
if sort == True:
sort_order = df_pisa_clean[var].value_counts().index
sb.countplot(data=df_pisa_clean, x=var, color=base_color, order = sort_order);
plt.xlabel(f"Student's {var}", fontsize = 12)
plt.ylabel('Number of Students',fontsize = 12)
plt.title(f"Student's Distribution by {var}")
else:
sb.countplot(data=df_pisa_clean, x=var, color=base_color);
plt.xlabel(f"Student's {var}", fontsize = 12)
plt.ylabel('Number of Students',fontsize = 12)
plt.title(f"Student's Distribution by {var}")
plt.figure(figsize=(10,6))
plt.subplot(1, 2, 1) # row 1, col 2 index 1
univariate_plot('Country', sort=True)
plt.subplot(1, 2, 2)# row 1, col 2 index 2
colors = ['tab:orange', 'tab:cyan']
sorted_counts = df_pisa_clean.Country.value_counts()
plt.pie(sorted_counts, labels = sorted_counts.index, autopct='%1.1f%%', explode=[0, 0.3],
startangle = 90, colors=colors, counterclock = False, labeldistance=None);
plt.axis('square');
plt.title("Student's Distribution by Country", pad=10)
plt.legend(title= 'Country', bbox_to_anchor=(1.3, 0.9), loc='upper right', borderaxespad=0.4);
#Source: https://www.geeksforgeeks.org/how-to-adjust-title-position-in-matplotlib/
# https://mldoodles.com/matplotlib-pie-chart/
Albania 4743 United Arab Emirates 490 Name: Country, dtype: int64
- In the PISA dataset provided, the Country variable which showed the students' country of residence was investigated. A countplot was employed to illustrate the country that has the most students.
- Most students that took part in the assessment came from Albania than from the United Arab Emirates according to the analysis. The effect of country on the overall performance of students will be investigated in the later part of this report.
plt.figure(figsize=(8,5))
univariate_plot('Gender', sort=True);
Female 2676 Male 2557 Name: Gender, dtype: int64
- In the PISA dataset provided, the Gender variable which showed the gender of students was investigated.
- There is not much difference between the male gender and female gender that partook in the assessment. The female gender is slightly more than the male gender in the dataset. Thus, it can be concluded that more of female students filled the survey than the male students.
plt.figure(figsize=(8,5))
univariate_plot('Immigration_status', sort=True);
Native 4486 First-Generation 230 Second-Generation 96 Name: Immigration_status, dtype: int64
Immigration Status: The migration background of a native-born adult is based on the country of birth of his/her parents. Thus, if neither parent is foreign-born, the native-born adult has native origins. Immigrant students are defined here as those who have at least one foreign-born parent. First-generation immigrant students are those who were born outside of a particular country, and second-generation immigrants are those who were born within that particular country or its territories. The analysis covers five immigrant populations :
- The Immigration status variable of students who partook in the survey was investigated.
- From the statistics and plot above, the highest proportion of students that partook in the survey were more of the Natives (93.2%) than both first-generation (4.8%), and second-generation immigrants (2.0%).
plt.figure(figsize=(6,5))
univariate_plot('Class_repetition',sort=True)
# Calculate the class_repeat_counts just to have clarity.
class_repeat_counts = df_pisa_clean['Class_repetition'].value_counts()
Total_repeat_or_not = df_pisa_clean['Class_repetition'].value_counts().sum()
# get the current tick locations and labels
locs, labels = plt.xticks(rotation=9)
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = class_repeat_counts[label.get_text()]
pct_string = '{:0.1f}%'.format(100*count/Total_repeat_or_not)
# print the annotation just below the top of the bar
plt.text(loc, count+2, pct_string, ha = 'center', color = 'black')
Did not repeat a grade 4601 Repeated a grade 232 Name: Class_repetition, dtype: int64
Class repetition(Grade repetition) involves the practice of holding back students who had failed to master the curriculum or meet the promotion criteria from reaching the next grade.
- In the Class_repetition variable investigated in the PISA dataset, the analysis reveal that 95.2% of the students who partook in the survey did not repeat a grade while 4.8% of the students repeated a grade.
- This signifies that more students took their studies seriously as against those that did not.
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
univariate_plot('Teacher_did_not_explain_well', sort=True);
plt.subplot(1,2,2)
univariate_plot('Teacher_did_not_get_students_interested', sort=True);
Slightly likely 1121 Not at all likely 805 Likely 715 Very likely 273 Name: Teacher_did_not_explain_well, dtype: int64 Slightly likely 1175 Not at all likely 711 Likely 706 Very likely 314 Name: Teacher_did_not_get_students_interested, dtype: int64
The "Teacher did not explain well" and "Teacher did not get students interested" variables are both subsets of student's attribution to failure factor.
From the barplots above, it is clear that for most students, the reason for their failure is slightly due to teachers not explaining well and that the teachers did not get them interested in the topics being taught.
NOTE:
# let's plot all the three categories of Student-Teacher relation together to get an idea of each ordinal variable's distribution of teacher support.
fig, ax = plt.subplots(nrows=3, figsize = [8,11])
default_color = sb.color_palette()[2]
# Dynamic-ordering the bars
# Count the frequency of each unique value in the column of interest, sort it in descending order and return a series
gen_order1 = df_pisa_clean['Student_teacher_relation_teachers_are_interested'].value_counts().index
gen_order2 = df_pisa_clean['Student_teacher_relation_teachers_listen_to_students'].value_counts().index
gen_order3 = df_pisa_clean['Student_teacher_relation_teachers_help_students'].value_counts().index
sb.countplot(data = df_pisa_clean, x = 'Student_teacher_relation_teachers_are_interested', order=gen_order1,
color = default_color, ax = ax[0])
sb.countplot(data = df_pisa_clean, x = 'Student_teacher_relation_teachers_listen_to_students',order=gen_order2,
color = default_color, ax = ax[1])
sb.countplot(data = df_pisa_clean, x = 'Student_teacher_relation_teachers_help_students',order=gen_order3,
color = default_color, ax = ax[2])
plt.show()
- From the various form of student-teacher relationship variables examined, the plot showed that most students states that there exist a very good relationship between the students and the teachers. This is seen in the 3 plots above as most students agreed that teachers are interested in their relationship, teachers listens to them and teachers also help them whenever the need arises.
- Therefore, it is recommended that teachers should continue to create an enabling environment which will give students an opportunity of understanding the concepts being taught. This kind of enabling environment is to be achieved by developing a good relationship with students.
# For the numeric form of the student-teacher relationship, I employed pandas cut function.
# This is used to segment and sort data values into bins.
# This function is also useful for going from a continuous variable to a categorical variable
df_pisa_clean['Student_teacher_relation'] = pd.cut(df_pisa_clean["Student_teacher_relation"], bins = 4,
labels = ["Strongly disagree","Disagree", "Agree", "Strongly agree"])
univariate_plot('Student_teacher_relation', sort=True);
Agree 1299 Strongly agree 1193 Disagree 495 Strongly disagree 14 Name: Student_teacher_relation, dtype: int64
- The transformed Student_teacher_relation variable into categorical variable gave similar observation as does the various form of student-teacher relationship explored above. Most students agreed that there exist a good relationship between the students and the teachers which is in consonant with what was obtained above. The order follows:
- Student-Teacher relation = Agree > Strongly Agree > Disagree > Strongly disagree.
- For subsequent analysis, this transformed student-teacher-relation variable will be used instead of the various form specifying the kind of relationship that existed between them, unless specified otherwise.
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
univariate_plot('Teacher_support_help_when_needed', sort=True);
plt.subplot(1,2,2)
univariate_plot('Teacher_support_help_learn', sort=True);
Strongly agree 1562 Agree 1241 Disagree 159 Strongly disagree 26 Name: Teacher_support_help_when_needed, dtype: int64 Strongly agree 1751 Agree 1103 Disagree 101 Strongly disagree 29 Name: Teacher_support_help_learn, dtype: int64
# For the numeric form of the Teacher-support variable, I employed pandas cut function as described previously.
df_pisa_clean['Teacher_support'] = pd.cut(df_pisa_clean["Teacher_support"], bins = 4,
labels = ["Strongly disagree", "Disagree", "Agree", "Strongly agree"])
univariate_plot('Teacher_support', sort=True);
Strongly agree 1616 Agree 1177 Disagree 205 Strongly disagree 23 Name: Teacher_support, dtype: int64
- The transformed Teacher_support variable into categorical variable gave similar observation as does the various form of Teacher_support variables investigated above. Most students agreed that the teachers support them in every way possible to achieve their academic goals. The order follows:
Teacher_support = Strongly Agree > Agree > Disagree > Strongly disagree.
For subsequent analysis, this transformed Teacher_support variable will be used instead of the various form specifying the kind of teachers support given to the students, unless specified otherwise.
Therefore to answer the question, Teacher support is an important variable in the PISA assessment given according to the student's survey.
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
univariate_plot('Class_management_teacher_starts_on_time', sort=True);
plt.subplot(1,2,2)
univariate_plot('Class_management_teacher_keep_class_orderly', sort=True);
Strongly agree 1826 Agree 980 Disagree 151 Strongly disagree 31 Name: Class_management_teacher_starts_on_time, dtype: int64 Strongly agree 1627 Agree 1173 Disagree 153 Strongly disagree 33 Name: Class_management_teacher_keep_class_orderly, dtype: int64
- The Class management variable in the assessment was investigated. Most of the students strongly agreed that one of the ways the teacher manages the classroom is to start teaching on time. Also, they strongly agreed that the teacher manages the classroom well by keeping the class orderly.
univariate_plot('Math_interest', sort=True);
Agree 1484 Strongly agree 1060 Disagree 328 Strongly disagree 79 Name: Math_interest, dtype: int64
- The variable Math_interest was investigated. It is believed that most students tend to show less interest in mathematics due to the enormous calculations and skills involved. Therefore, any student with positive attitude/interest towards Mathematics, studies the subject because he/she likes or has pleasure in it.
- Interestingly, the analysis showed that most students that partook in the assessment showed strong interest for mathematics literacy.
univariate_plot('School_does_little_to_prepare_me_for_life', sort=True);
Disagree 1235 Strongly disagree 664 Agree 578 Strongly agree 489 Name: School_does_little_to_prepare_me_for_life, dtype: int64
- Schools focus on academic knowledge and teach students to memorize information, which gives them extremely low chances to learn critical life skills. Schools focus on preparing students for universities, but not for jobs and real life. It doesn't teach them how to manage money, how to negotiate and how to communicate.
- According to a survey by the Association of American Colleges and Universities published in 2015, only 55 percent of high school students feel prepared to enter the real world.
- Although schools do expose students to valuable skills such as perseverance, responsibility, and social skills, they do not account for the skills used in day-to-day life. It is fair to say that students who have just graduated from high school have no set of skills. They do not know how to live in the real world. So the variable "School_does_little_to_prepare_me_for_life" variable was explored.
- Contrarily to these reports reported in the literature, the analysis depicts that most students disgree with the fact that school does little to prepare them for life. Most of the students are of the opinion that school plays a major role in preparing them for life.
# Calculate the perseverance_counts and sum just to have clarity.
persev_counts = df_pisa_clean['Perseverance_Give_up_easily'].value_counts()
persev_order = df_pisa_clean['Perseverance_Give_up_easily'].value_counts().index
# Returns the sum of all not-null values in `Perseverance_Give_up_easily` column
persev_sum = df_pisa_clean['Perseverance_Give_up_easily'].value_counts().sum()
print(df_pisa_clean.Perseverance_Give_up_easily.value_counts())
plt.figure(figsize=(8,7))
sb.countplot(data=df_pisa_clean, y='Perseverance_Give_up_easily', color=base_color, order=persev_order);
plt.xlabel("Number of Students", fontsize =12)
plt.ylabel("Student's Perseverance level (Give up easily)", fontsize =12)
plt.title("Student's Distribution by Perseverance level (Give up easily)", fontsize =12)
# Logic to print the proportion text on the bars
for i in range (persev_counts.shape[0]):
# Remember, persev_counts contains the frequency of unique values in the `Perseverance_Give_up_easily` column in decreasing order.
count = persev_counts[i]
# Convert count into a percentage, and then into string
pct_string = '{:0.1f}%'.format(100*count/persev_sum)
# Print the string value on the bar.
plt.text(count+1, i, pct_string, va='center')
Not at all like me 1108 Not much like me 707 Somewhat like me 399 Mostly like me 369 Very much like me 335 Name: Perseverance_Give_up_easily, dtype: int64
- In the Perseverance_Give_up_easily variable investigated in the PISA dataset, the statistics and the visualization illustrate that most students are certain to persevere; as they don't give up easily in the course of their academic pursuit.
- The order of the trend of students that give up easily follows: Not at all like me > Not much like me > Somewhat like me > Mostly like me > Very much like me.
- This analysis implies that most students who partook in the assessment are determined to solve problems or challenges that spring forth in the course of their academic pursuit than those that are likely to give up easily.
parent_edu_counts = df_pisa_clean['Highest_educational_level_parents'].value_counts()
parent_edu_order = df_pisa_clean['Highest_educational_level_parents'].value_counts().index
# Returns the sum of all not-null values in `Highest_educational_level_parents` column
parent_edu_sum = df_pisa_clean['Highest_educational_level_parents'].value_counts().sum()
print(df_pisa.Highest_educational_level_parents.value_counts())
plt.figure(figsize=(8,7))
parent_edu_order = df_pisa_clean['Highest_educational_level_parents'].value_counts().index
sb.countplot(data=df_pisa_clean, y='Highest_educational_level_parents', color=base_color, order=parent_edu_order);
plt.xlabel("Number of Students", fontsize =12)
plt.ylabel("Parent's Highest Educational level", fontsize =12)
plt.title("Student's Distribution by Parent's Highest Educational level", fontsize =12)
# Logic to print the proportion text on the bars
for i in range (parent_edu_counts.shape[0]):
# Remember, parent_edu_counts contains the frequency of unique values in the `Highest_educational_level_parents` column in decreasing order.
count = parent_edu_counts[i]
# Convert count into a percentage, and then into string
pct_string = '{:0.1f}%'.format(100*count/parent_edu_sum)
# Print the string value on the bar.
plt.text(count+1, i, pct_string, va='center')
ISCED 3A, 4 2141 ISCED 5A, 6 1392 ISCED 2 780 None 368 ISCED 5B 341 ISCED 3B, C 152 ISCED 1 51 Name: Highest_educational_level_parents, dtype: int64
International Standard Classification of Education(ISCED): is the reference international classification for organising education programmes and related qualifications by levels and fields.
- ISCED 3A: Programmes designed to provide direct access to ISCED 5A;
- ISCED 3B: Programmes designed to provide direct access to ISCED 5B; ISCED 3C: Programmes not designed to lead to ISCED 5A or 5B.
- ISCED 5A: Programmes that are largely theoretically based and are intended to provide sufficient qualifications for gaining entry into advanced research programmes and professions with high skills requirements.
- ISCED 5B: Programmes that are practically oriented/ occupationally specific and are mainly designed for participants to acquire the practical skills and know-how needed for employment in a particular occupation or trade or class of occupations or trades, the successful completion of which usually provides the participants with a labour-market relevant qualification
ISCED 6 - SECOND STAGE OF TERTIARY EDUCATION (LEADING TO AN ADVANCED RESEARCH QUALIFICATION)
To understand more about the ISCED, click here
- In the survey analyzed statistically and visually, the highest educational level of parent variable investigated in the PISA dataset showed the parents of most students who partook in the survey have the highest educational level of ISCED 4. One would have presumed that the students whose parent's level of education is ISCED 6 will pertake in the survey more than students whose parents have lower qualification than ISCED 6. This presumption is wrong as proven in the statistics and plot above. The trend of parent's highest educational level of students who partook in the survey follow the order: ISCED 3A, 4 > ISCED 5A,6 > ISCED 2 > ISCED 5B > ISCED 3B,C > ISCED 1 respectively
Secondly, let us look at visualization of individual numerical/quantitative variables
Since the Age and Grade appear discrete, we can use the barplot kind of plot to understand the trend in the Age of the students
plt.figure(figsize=[8,5])
univariate_plot('Age', sort=True);
15 3428 16 1805 Name: Age, dtype: int64
- The age of the most students that partook in the survey is 15years old.
plt.figure(figsize=[8,6])
univariate_plot('Grade');
0.0 3191 -1.0 1803 -2.0 119 1.0 105 -3.0 12 2.0 2 Name: Grade, dtype: int64
The relative grade index indicates whether students are at the modal grade in a country (value of 0), or whether they are below or above the modal grade level (+ x grades, -x grades). Find out more about the modal grade in a country here
- Grade is the variable analyzed in the PISA dataset. The Grade analyzed is compared to modal grade in country.
- This analysis implies that more students are just at the modal grade in their country than those students that were below or above the modal grade.
# Create a loop for other numeric variables.
def univariate_num(var):
print(df_pisa_clean[var].describe());
plt.figure(figsize=[15,7])
# HISTOGRAM ON LEFT: full data without scaling
plt.subplot(1, 2, 1)
plt.hist(data= df_pisa_clean, x= var, color = 'green');
plt.xlabel(var, fontsize = 12)
plt.ylabel('Number of Students',fontsize = 12)
plt.title(f"Student's Distribution by {var}")
# HISTOGRAM ON LEFT: full data with scaling
plt.subplot(1, 2, 2)
plt.hist(data= df_pisa_clean, x= var, color = 'blue', bins=bins);
plt.xlabel(var, fontsize = 12)
plt.ylabel('Number of Students',fontsize = 12)
plt.title(f"Student's Distribution by {var}")
bins = np.arange(df_pisa_clean['Average_math_literacy'].min(), df_pisa_clean['Average_math_literacy'].max()+10, 15)
univariate_num('Average_math_literacy')
count 5232.000000 mean 399.899489 std 88.897515 min 101.347560 25% 343.499770 50% 399.719590 75% 456.543080 max 688.666440 Name: Average_math_literacy, dtype: float64
The engineered variable Average_math_literacy is the variable analyzed here. In the left plot, the data on the histogram is without scaling. A bimodal distribution is observed (with two peaks or humps) and the direct adjacency of the bars emphasizes that the data takes on a continuous range of values. After scaling, the resulting histogram on the right showed that the data is normally distributed although with the bimodal peaks still evident. Both peaks are between 390 and 420. Also, the minimum mathematics literacy score is around 100 and the maximum score is about 690.
bins = np.arange(df_pisa_clean['Average_reading_literacy'].min(), df_pisa_clean['Average_reading_literacy'].max()+10, 15);
univariate_num('Average_reading_literacy');
count 5232.000000 mean 401.605365 std 106.738703 min 23.722020 25% 337.277465 50% 408.211890 75% 474.934920 max 743.001640 Name: Average_reading_literacy, dtype: float64
The engineered variable Average_reading_literacy is the variable analyzed here. In the left plot, the data on the histogram is without scaling. A single modal distribution is observed (as against what was operational for the average_math_literacy). After scaling, the resulting histogram on the right showed that the data is normally distributed with a distinct modal peak between 430-440. Also, the minimum reading literacy score is around 20 and the maximum score is about 700. The bin after 700 may be regarded as an outlier.
bins = np.arange(df_pisa_clean['Average_science_literacy'].min(), df_pisa_clean['Average_science_literacy'].max()+10, 15)
univariate_num('Average_science_literacy')
count 5232.000000 mean 405.439042 std 93.609631 min 89.276400 25% 349.254020 50% 408.327110 75% 466.001505 max 716.840720 Name: Average_science_literacy, dtype: float64
The engineered variable Average_science_literacy is the variable analyzed here. The left plot which is the histogram without scaling resembles that of the average_math literacy. A bimodal distribution is observed. After scaling, the resulting histogram on the right showed that the data is normally distributed with a distinct modal peak between 370-400. The minimum science literacy score is around 90 and the maximum score is about 715.
bins = np.arange(df_pisa_clean['Academic_performance'].min(), df_pisa_clean['Academic_performance'].max()+10, 15)
univariate_num('Academic_performance')
count 5232.000000 mean 402.314632 std 92.082094 min 88.394973 25% 347.176710 50% 405.035353 75% 463.037698 max 706.173240 Name: Academic_performance, dtype: float64
Academic performance is the measurement of student achievement across various academic subjects.
The engineered variable Academic performance is the variable analyzed here. In the left plot, the data on the histogram is without scaling. A bimodal distribution is observed (with two peaks or humps) and the direct adjacency of the bars emphasizes that the data takes on a continuous range of values. After scaling, the resulting histogram on the right showed that the data is normally distributed although with stepwise modal peaks. The highest peak is between 420 and 440. In the subsequent bivariate exploration, the Academic performance will serve as the dependent variable employed for further analysis.
df_pisa_clean['Average_learning_time'].describe().reset_index()
| index | Average_learning_time | |
|---|---|---|
| 0 | count | 2937.000000 |
| 1 | mean | 176.640393 |
| 2 | std | 66.493597 |
| 3 | min | 0.000000 |
| 4 | 25% | 135.000000 |
| 5 | 50% | 165.000000 |
| 6 | 75% | 210.000000 |
| 7 | max | 980.000000 |
bins = np.arange(df_pisa_clean['Average_learning_time'].min(), df_pisa_clean['Average_learning_time'].max()+10, 15)
univariate_num('Average_learning_time')
count 2937.000000 mean 176.640393 std 66.493597 min 0.000000 25% 135.000000 50% 165.000000 75% 210.000000 max 980.000000 Name: Average_learning_time, dtype: float64
plt.figure(figsize=[10,5])
sb.violinplot(x=df_pisa_clean['Average_learning_time'], color='green');
plt.xlabel('Average_learning_time (mins/week)', fontsize = 12)
plt.title("Student's distribution by Average Learning Time (mins/week)");
In the two histogram plots of average learning time above, it can be observed that the plots seem skewed to the right, suggesting that there might be outliers. So, violinplot was employed to validate the presence of the outliers. As predicted, the violinplot proved the presence of the outside points by the right called outliers. The outliers are the long green line by the right, outside the upper adjacent values (black lines stretched from the bar), at about 320mins/week and above. Therefore, steps are needed to be taken to clean up the outliers. To this effect, the code below was used.
cols = ['Average_learning_time'] # The column(s) you want to search for outliers in
# Calculate quantiles and IQR
Q1 = df_pisa_clean[cols].quantile(0.25) # Same as np.percentile but maps (0,1) and not (0,100)
Q3 = df_pisa_clean[cols].quantile(0.75)
IQR = Q3 - Q1
# Return a boolean array of the rows with (any) non-outlier column values
condition = ~((df_pisa_clean[cols] < (Q1 - 1.5 * IQR)) | (df_pisa_clean[cols] > (Q3 + 1.5 * IQR))).any(axis=1)
# Filter our dataframe based on condition
df_pisa_clean = df_pisa_clean[condition]
df_pisa_clean['Average_learning_time'].describe().reset_index()
| index | Average_learning_time | |
|---|---|---|
| 0 | count | 2838.000000 |
| 1 | mean | 168.552972 |
| 2 | std | 47.190009 |
| 3 | min | 40.000000 |
| 4 | 25% | 135.000000 |
| 5 | 50% | 165.000000 |
| 6 | 75% | 195.000000 |
| 7 | max | 320.000000 |
plt.figure(figsize=[16,7])
# Boxplot ON LEFT
plt.subplot(1, 2, 1)
sb.violinplot(y=df_pisa_clean['Average_learning_time'], color='green');
plt.xlabel('Average_learning_time (mins/week)', fontsize = 12)
plt.ylabel('Number of Students',fontsize = 12)
plt.title("Plt 1A: Student's distribution by Average Learning Time (mins/week)");
# HISTOGRAM ON RIGHT: full data with scaling
plt.subplot(1, 2, 2)
bins = np.arange(df_pisa_clean['Average_learning_time'].min(), df_pisa_clean['Average_learning_time'].max()+10, 15)
plt.hist(data= df_pisa_clean, x= 'Average_learning_time', color = 'blue', bins=bins);
plt.xlabel('Average_learning_time (mins/week)', fontsize = 12)
plt.ylabel('Number of Students',fontsize = 12)
plt.title("Plt 1B: Student's distribution by Average Learning Time (mins/week)")
plt.xlim(38,330)
(38.0, 330.0)
- Another engineered variable, Average learning time, is the variable analyzed here. From the statistical analysis done above, it can be seen that the outliers have been removed. This is evident in the decrease in the number of counts before and after cleaning; 2937 vs 2838 respectively.
- As seen in plot on the left (plot 1A), the violinplot showed that outliers have been take care of as against what was shown before cleaning. It can also be observed that there exist two modal classes in the plot. To obtain more information on the variable analyzed, histogram plot was employed. After scaling, the visualization of Plot 1B further validates what was observed in the violinplot. The scaled histogram showed that there are two peaks in the plot. The first is between 120-130mins and the second is between 170-180mins.
bins = np.arange(2,df_pisa_clean['Highest_parental_education_years'].max()+2, 2.5);
univariate_num('Highest_parental_education_years');
print('\n')
print(df_pisa_clean.Highest_parental_education_years.value_counts())
count 5134.000000 mean 12.273861 std 3.459739 min 3.000000 25% 12.000000 50% 12.000000 75% 16.000000 max 16.000000 Name: Highest_parental_education_years, dtype: float64 12 2271 16 1635 10 752 3 375 6 42 15 33 9 20 5 6 Name: Highest_parental_education_years, dtype: int64
- The variable analyzed here is the highest parental education in years. In the dual plot above, the unscaled plot on the left did not represent the variable properly so I set the bin size to show the plot on the right. The plot seems skewed more to the left, implying that most parents tend to have spent reasonable education years during their educational period.
- The student's distribution by the highest parental education showed that most parents of the students who took the assessment spent about 12 years and above in their education.
- This analysis depicts that most parents of the students that took the survey had finished their secondary school education. This is in consonant with the findings observed for the highest education level of parents variable suggesting that the students of parents who have finished their POST-SECONDARY NON TERTIARY EDUCATION tend to partake more in the PISA assessment than the others.
The variable(s) of interest are Academic performance and Average learning time. The full data of these two variables were individually plotted on histogram without scaling. The plot for the academic performance variable showed that the histogram seems skewed to the left while that of the average learning time variable seems skewed to the right. So I employed the axis limit transformation and the setting of bin edges for better plots. Meanwhile, when I employed the log transformation for the academic performance variable, the plot obtained was not as distributed as that of the axis limit transformation. Employing the axis limit transformation and setting the bins for the academic variable, a stepwise modal distribution was obtained. The highest peak is between 420 and 440. However, when similar bin edges was set for the average learning time data, the plot wasn't still properly distributed as there were outliers observed in the histogram plot. Therefore, steps were taken to clean up the outliers.
Of the features investigated, yes, there were unusual distribution. In the average learning time variable, outliers were identified when the histogram and violinplots were plotted on the variable data. The violinplot showed the presence of several inconsistent data points of the average learning time variable at about 320mins/week and above, outside the upper adjacent values that stretches to the right (OUTLIERS). Therefore, necessary steps were taken to remove the outliers before plots were made. A code that can be find here and here was employed to remove the outliers. After cleaning and setting axis limit transformation, the violinplot and the histogram plot were repeated to verified if the outliers were removed. The result gave a clean bimodally distributed violinplot and histogram plot respectively.
In this section, investigation of the relationships between pairs of variables in the df_pisa_clean dataset will be made. It should be noted that the dependent variable will be the Academic_performance.
Herein, I want to look at the relationship between numeric variables to inspect how they correlate with one another. To achieve this, I will employ the heatmap to examine the degree of correlation among the numeric variables investigated in this dataset.
numeric_variables = ['Average_math_literacy','Average_reading_literacy','Average_science_literacy','Academic_performance',
'Average_learning_time','Age', 'Highest_parental_education_years', 'Grade']
plt.figure(figsize=(10,6))
sb.heatmap(df_pisa_clean[numeric_variables].corr(), cmap='rocket_r', annot=True, fmt='.2f', vmin=0, center=0);
print('The correlation between Average learning time and Academic performance is',
df_pisa_clean['Average_learning_time'].corr(df_pisa_clean['Academic_performance']).round(2))
The correlation between Average learning time and Academic performance is 0.11
plt.figure(figsize = [18, 8])
plt.subplot(1, 2, 1)
sb.regplot(data = df_pisa_clean, x = 'Average_learning_time', y = 'Academic_performance', truncate=False,
x_jitter=0.3, scatter_kws={'alpha':1/15});
plt.xlabel('Average learning time (minutes/week)', fontsize=12);
plt.ylabel('Academic_performance', fontsize=12);
plt.title("Student's Average learning time vs Academic performance");
plt.subplot(1, 2, 2)
bins_x = np.arange(30, 300+25, 25)
bins_y = np.arange(150, 700, 50)
cdv = df_pisa_clean.dropna(subset=['Average_learning_time', 'Academic_performance']).reset_index()
plt.hist2d(data = cdv, x = 'Average_learning_time', y = 'Academic_performance',
cmin=0.5, cmap='viridis_r', bins = [bins_x, bins_y])
plt.colorbar()
plt.xlabel('Average learning time (minutes/week)', fontsize=12);
plt.ylabel('Academic_performance', fontsize=12);
plt.title("Student's Average learning time vs Academic performance");
- There is a common notion that the more a student spends time in learning, the better the student's academic performance. So, let's see if this applies here.
- This plot seeks to analyze the relationship between Academic performance and average learning time. For this numeric variable, the scatter plot was employed. The correlation between Average learning time and Academic performance is positive (0.11), although very weak. The hist2d plot show that at 125mins/week, students tend to score a little less than 400 and at an increased learning time of 163mins/week, students tend to score above 400 in their academic performance.
- This implies that as students spend more time in learning there is the probability of the students performing very well in their academics.
# For subsequent relationship between two numeric varibales, the function in the code below will be used to avoid code repetition.
def bivar_numeric(var, num):
print(f'The correlation between {var} and {num} is',
df_pisa_clean[num].corr(df_pisa_clean[var]).round(2))
plt.figure(figsize=[10,8])
sb.regplot(data = df_pisa_clean, x = num, y = var, truncate=False,
x_jitter=0.2, scatter_kws={'alpha':1/4});
plt.xlabel(num, fontsize=12);
plt.ylabel(var, fontsize=12);
plt.title(f"Student's Distribution by {var} and {num}");
bivar_numeric("Academic_performance", "Average_math_literacy")
The correlation between Academic_performance and Average_math_literacy is 0.95
Yes, there is a very strong relationship student's academic performance and the average math literacy. Analyzing the two numeric variables; Academic_performance and Average_math_literacy, the scatter plot employed showed that is a very strong positive correlation between these two as the correlation value is 0.95. This indicate that as a student's average_math_literacy score is increasing, the academic performance of the students is also increasing in the same manner.
bivar_numeric("Academic_performance","Average_science_literacy")
The correlation between Academic_performance and Average_science_literacy is 0.96
bivar_numeric("Academic_performance","Average_reading_literacy")
The correlation between Academic_performance and Average_reading_literacy is 0.95
bivar_numeric("Average_math_literacy","Average_science_literacy")
The correlation between Average_math_literacy and Average_science_literacy is 0.88
bivar_numeric("Average_math_literacy","Average_reading_literacy")
The correlation between Average_math_literacy and Average_reading_literacy is 0.86
bivar_numeric("Average_science_literacy","Average_reading_literacy")
The correlation between Average_science_literacy and Average_reading_literacy is 0.86
Examining the relationship between student's achievement in reading literacy and science literacy showed that there is a positive relationship between the two variables(Average_reading_literacy and Average_science_literacy) with a correlation value of 0.86. An increase in the student's achievement reading literacy score will also facilitate an increase in the student's science literacy score.
Also, I will like to understand the relationship between categorical variables and numeric variables. The dependent variable is the "Academic performance" and other variables will act as the independent variables. In addition, I will want to employ the box/violin plot and facetgrid simultaneously to understand the relationship between categorical variables and numeric variables in this bivariate exploration. This is because the violin/boxplot will give information about the statistics (such as the mean, median, upper/lower quartiles and the minimum/maximum values) of the variables analyzed while facetgrid will provide me with the information regarding the frequency/Number of students with respect to each category variable type.
- Before then, I will like to sort and arrange the ordinal categorical variables into the correct order.
ordinal_var_dict = {'Class_repetition': ['Did not repeat a grade', 'Repeated a grade'],
'Immigration_status': ['Native', 'First-Generation', 'Second-Generation'],
'Perseverance_Give_up_easily': ['Not at all like me','Not much like me','Somewhat like me',
'Mostly like me', 'Very much like me'],
'Teacher_support': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'Student_teacher_relation': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'Math_interest': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'School_does_little_to_prepare_me_for_life': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'Class_management_teacher_keep_class_orderly': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'Class_management_teacher_starts_on_time': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'Student_teacher_relation_teachers_are_interested': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'Student_teacher_relation_teachers_listen_to_students': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'Student_teacher_relation_teachers_help_students': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'Teacher_support_help_when_needed': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'Teacher_support_help_learn': ['Strongly agree', 'Agree', 'Disagree', 'Strongly disagree'],
'Teacher_did_not_explain_well': ['Very likely', 'Likely', 'Slightly likely', 'Not at all likely'],
'Teacher_did_not_get_students_interested': ['Very likely', 'Likely', 'Slightly likely', 'Not at all likely'],
'Highest_educational_level_parents': ['ISCED 1','ISCED 2','ISCED 3B, C','ISCED 3A, 4','ISCED 5B',
'ISCED 5A, 6']}
for var in ordinal_var_dict:
pd_ver = pd.__version__.split(".")
if (int(pd_ver[0]) > 0) or (int(pd_ver[1]) >= 21): # v0.21 or later
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
df_pisa_clean[var] = df_pisa_clean[var].astype(ordered_var)
else: # pre-v0.21
df_pisa_clean[var] = df_pisa_clean[var].astype('category', ordered = True,
categories = ordinal_var_dict[var])
def bivar_catnum(var, cat, col_wrap, height, bins=None):
if bins == True:
g = sb.FacetGrid(data = df_pisa_clean, col = cat, col_wrap=col_wrap, height=height)
g.map(plt.hist, var, bins=bin_edges, color='purple');
for axis in g.axes.flat:
axis.tick_params(labelleft=True)
plt.setp(g.axes, xlabel='Academic_performance', ylabel='Number of Students');
g.fig.subplots_adjust(top=0.89);
g.fig.suptitle(f"Student's Distribution by {var} and {cat}");
else:
g = sb.FacetGrid(data = df_pisa_clean, col = cat, col_wrap=col_wrap, height=height)
g.map(plt.hist, var);
for axis in g.axes.flat:
axis.tick_params(labelleft=True);
plt.setp(g.axes, xlabel='Academic_performance', ylabel='Number of Students');
g.fig.subplots_adjust(top=0.89);
g.fig.suptitle(f"Student's Distribution by {var} and {cat}");
# Source: https://stackoverflow.com/questions/36573789/python-seaborn-facetgrid-change-xlabels
# https://stackoverflow.com/questions/72196032/how-to-customize-histogram-using-seaborn-facetgrid
# https://stackoverflow.com/questions/29813694/how-to-add-a-title-to-seaborn-facet-plot
base_color =sb.color_palette()[9]
def bivar_numcat(var1, var2, kind):
plt.figure(figsize=[8,6])
if kind == 'violin':
sb.violinplot(data=df_pisa_clean, x=var1, y=var2, color=base_color, inner='quartile')
plt.xlabel(var1, fontsize = 12)
plt.ylabel(var2,fontsize = 12)
plt.title(f"Student's distribution by {var1} and {var2}")
elif kind=='box':
sb.boxplot(data=df_pisa_clean, x=var1, y=var2, color=base_color)
plt.xlabel(var1, fontsize = 12)
plt.ylabel(var2,fontsize = 12)
plt.title(f"Student's Distribution by {var1} and {var2}")
elif kind=='bar':
sb.barplot(data=df_pisa_clean, x=var1, y=var2, color=base_color)
plt.xlabel(var1, fontsize = 12)
plt.ylabel(f" Average {var2}", fontsize = 12)
plt.title(f"Student's Distribution by {var1} and {var2}")
df_pisa_clean.groupby('Country')['Academic_performance'].median()
Country Albania 401.239893 United Arab Emirates 459.147967 Name: Academic_performance, dtype: float64
bivar_numcat('Academic_performance', 'Country', "violin")
bin_edges = np.arange(df_pisa_clean['Academic_performance'].min(), df_pisa_clean['Academic_performance'].max()+10,20)
bivar_catnum('Academic_performance', 'Country', col_wrap=2, height=6, bins=True)
These plots seek to analyze the relationship between two variables namely: Academic performance and Country. The violinplot reveal that students from United Arab Emirates performed better academically than students from Albania. This is revealed in the median value of 460 for students from United Arab Emirates and about 410 for students from Albania respectively.In support of this observation, the histogram plot show that the highest academic performance bar (mode) for students in Albania is about 410 while that for students in United Arab Emirates is between 450-470 repectively.
Meanwhile, the violinplot for Albania has wider width than the violinplot for United Arab Emirates, indicating that many students from Albania partook in the survey than the few students from United Arab Emirates. In agreement, the histogram plot also showed that **MORE than 480 students from Albania scored above 400 in their academic studies as against the few students from the United Arab Emirates scoring above 400.
bivar_numcat('Gender','Academic_performance', "bar")
bivar_catnum('Academic_performance', 'Gender', col_wrap=2, height=6, bins=True)
- Past research suggested that girls are in general more successful in school than boys. In view of this belief that students' gender is a factor that influence the students' academic performance, I looked into the relationship between academic performance and gender variables.
- The violinplot showed that there is no major significant difference in the relationship between academic performance and gender i.e. both male and female students' perform in almost similar manner in their academic studies, however, the histogram plot showed that females perform slightly better than male in their academic achievement. Thus, validating the report that female students outperform their male counterparts in academic performance (Orabi, 2007; Dayioglu & Turut, 2007; Khwaileh & Zaza, 2010).
bivar_numcat('Class_repetition', 'Academic_performance','violin');
df_pisa_clean.groupby('Class_repetition')['Academic_performance'].mean()
Class_repetition Did not repeat a grade 402.768040 Repeated a grade 399.179649 Name: Academic_performance, dtype: float64
bivar_catnum('Academic_performance', 'Class_repetition', col_wrap=2, height=6, bins=True)
bivar_numcat('Immigration_status', 'Academic_performance','bar');
bivar_catnum('Academic_performance', 'Immigration_status', col_wrap=3, height=5, bins=True)
plt.figure(figsize=(8,6));
bivar_numcat('Highest_educational_level_parents', 'Academic_performance','bar');
plt.xticks(rotation=15);
<Figure size 576x432 with 0 Axes>
bivar_catnum('Academic_performance', 'Highest_educational_level_parents', col_wrap=3, height=4.5, bins=True)
bivar_numcat('Perseverance_Give_up_easily', 'Academic_performance','bar');
plt.xticks(rotation=15);
bivar_catnum('Academic_performance', 'Perseverance_Give_up_easily', col_wrap=3, height=4.5, bins=True)
bivar_catnum('Academic_performance', 'School_does_little_to_prepare_me_for_life', col_wrap=3, height=4.5, bins=True)
bivar_numcat('School_does_little_to_prepare_me_for_life', 'Academic_performance','bar');
plt.xticks(rotation=15);
bivar_numcat('Class_management_teacher_keep_class_orderly', 'Academic_performance','bar');
bivar_numcat('Class_management_teacher_starts_on_time', 'Academic_performance', 'bar');
bivar_numcat('Teacher_did_not_explain_well', 'Academic_performance','bar');
bivar_numcat('Teacher_did_not_get_students_interested','Academic_performance','bar');
df_pisa_clean.groupby('Teacher_support')['Academic_performance'].median()
Teacher_support Strongly agree 402.470727 Agree 406.085850 Disagree 411.691027 Strongly disagree 446.834847 Name: Academic_performance, dtype: float64
bivar_numcat('Teacher_support', 'Academic_performance','violin');
bivar_catnum('Academic_performance', 'Student_teacher_relation', col_wrap=3, height=4.5, bins=True)
bivar_numcat('Student_teacher_relation','Academic_performance', 'bar');
After examining the relationship between academic performance and other variables, I also looked into the relationship between average learning time of students and other features. The code below was created to loop through the features so as to avoid code repetition.
base_color =sb.color_palette()[8];
bivar_numcat('Immigration_status', 'Average_learning_time', "violin");
plt.ylabel("Average learning time (mins/week)");
base_color =sb.color_palette()[8];
bivar_numcat('Class_repetition', 'Average_learning_time', "violin");
plt.ylabel("Average learning time (mins/week)");
base_color =sb.color_palette()[8];
bivar_numcat('Highest_educational_level_parents', 'Average_learning_time', "violin");
plt.ylabel("Average learning time (mins/week)");
plt.xticks(rotation=15);
base_color =sb.color_palette()[0];
bivar_numcat('Math_interest', 'Average_math_literacy', "box");
In this section, plots of three or more variables in the PISA dataset were investigated. I made sure my investigations were justified, and follow from my work in the previous sections.
plt.figure(figsize=(18,6))
plt.subplot(1,2,1)
df_heatmap = df_pisa_clean.pivot_table(values='Academic_performance',index='Class_repetition',
columns='Immigration_status',aggfunc=np.mean)
sb.heatmap(df_heatmap,annot=True, fmt = '.2f',
cbar_kws = {'label' : 'mean(Academic_performance)'})
plt.xlabel('Immigration_status', fontsize =12)
plt.ylabel('Academic_performance', fontsize =12)
plt.title('Effect of Immigration status and Class_repetition on the Academic performance of students', fontsize =12);
plt.subplot(1,2,2)
ax = sb.barplot(data = df_pisa_clean, x = 'Immigration_status', y = 'Academic_performance', hue = 'Class_repetition')
ax.legend(loc = 8, ncol = 3, framealpha = 1, title = 'Class_repetition');
plt.xlabel('Immigration_status', fontsize =12)
plt.ylabel('Academic_performance', fontsize =12)
plt.title('Effect of Immigration status and Class_repetition on the Academic performance of students', fontsize =12);
def multi_catnum(num, cat1, cat2, kind, height, col_wrap):
if kind == "bar" and height== None and col_wrap ==None:
plt.figure(figsize=(13,11))
ax = sb.barplot(data = df_pisa_clean, x = cat1, y = num, hue = cat2, palette='viridis_r')
plt.setp(ax.axes, xlabel= cat1, ylabel= num);
plt.title(f"Effect of Student's {cat1} and {cat2} on Academic Performance");
elif kind == "heatmap" and height== None and col_wrap ==None:
plt.figure(figsize=(18,5))
df_heatmap = df_pisa_clean.pivot_table(values=num,index=cat1,
columns=cat2,aggfunc=np.mean)
sb.heatmap(df_heatmap,annot=True, fmt = '.2f', cbar_kws = {'label' : 'mean(num)'})
plt.title(f"Effect of Student's {cat1} and {cat2} on Academic Performance");
elif kind == "violin" and height == height and col_wrap==col_wrap:
g=sb.FacetGrid(data= df_pisa_clean, col = cat1, margin_titles=True, col_wrap=col_wrap, height =height)
g.map(sb.violinplot, num, cat2, inner='quartile')
g.fig.subplots_adjust(top=0.89);
g.fig.suptitle(f"Effect of Student's {cat1} and {cat2} on Academic Performance");
plt.show()
multi_catnum("Academic_performance", "Immigration_status","Perseverance_Give_up_easily", kind='bar', height=None, col_wrap=None)
multi_catnum("Academic_performance", "Class_repetition","Highest_educational_level_parents", kind='bar',
height=None, col_wrap=None);
multi_catnum("Academic_performance", "Class_repetition","Perseverance_Give_up_easily", kind='bar',
height=None, col_wrap=None);
The plot to address the effect of class repetition and student's perseverance level to give up easily on their academic studies reveal that:
multi_catnum("Academic_performance", "Highest_educational_level_parents","Perseverance_Give_up_easily",
kind='heatmap', height=None, col_wrap=None)
multi_catnum("Academic_performance", "Highest_educational_level_parents","Perseverance_Give_up_easily",
kind='violin', height=5, col_wrap=3)
C:\Users\OLALEKAN\anaconda3\lib\site-packages\seaborn\axisgrid.py:670: UserWarning: Using the violinplot function without specifying `order` is likely to produce an incorrect plot. warnings.warn(warning)
# A loop was also created to explore relationship among other variables.
def multi_numcat(cat, num1, num2, kind, col_wrap, height):
if kind == 'color_hue':
g=sb.FacetGrid(data= df_pisa_clean, hue = cat, height = 7, aspect=1.5)
g.map(sb.regplot,num1, num2, x_jitter =0.3, scatter_kws = {'alpha' : 0.8}, fit_reg=False)
g.add_legend()
g.fig.suptitle(f"Effect of Student's {cat} and {num1} on {num2}")
elif kind== 'color_bar':
plt.scatter(data= df_pisa_clean, x=num1, y=num2, c=num3, alpha = 1)
plt.colorbar(label=num3)
plt.xlabel(num1)
plt.ylabel(num2)
plt.title((f"Effect of Student's {num3} and {num1} on {num2}"))
elif kind == 'face':
g=sb.FacetGrid(data= df_pisa_clean, col = cat, margin_titles=True, col_wrap=col_wrap, height =height)
g.map(plt.scatter,num1, num2, alpha = 1)
g.fig.subplots_adjust(top=0.89);
g.fig.suptitle(f"Effect of Student's {cat} and {num1} on {num2}");
plt.show()
multi_numcat("Perseverance_Give_up_easily", "Average_learning_time", "Academic_performance",
kind = 'face', col_wrap= 3, height = 4.5)
multi_numcat("Class_repetition", "Average_learning_time", "Academic_performance", kind = 'face', col_wrap= 2, height = 4.5)
multi_numcat("Class_repetition", "Average_learning_time", "Academic_performance",
kind = 'color_hue', col_wrap= None, height = 4.5)
multi_numcat("Immigration_status", "Average_learning_time", "Academic_performance",
kind = 'color_hue', col_wrap= None, height = 4.5)
multi_numcat("Highest_educational_level_parents", "Average_learning_time", "Academic_performance",
kind = 'face', col_wrap= 3, height = 4)
plt.figure(figsize = [15, 10])
plt.scatter(data= df_pisa_clean, x="Average_learning_time", y="Academic_performance", c="Highest_parental_education_years",
alpha = 1)
plt.colorbar(label='Highest_parental_education_years')
plt.xlabel('Average_learning_time')
plt.ylabel('Academic_performance')
plt.xlim(70, 330);
plt.ylim(180,600);
multi_numcat("Grade", "Average_learning_time", "Academic_performance",
kind = 'face', col_wrap= 3, height = 4)
The next exploration involves the examining the relationship among four variables. I created a loop as seen below for this exploration too.
def quadri_var(num1, num2, cat1, cat2, title):
# I discovered that the top margin of the plot title differs from plot to plot. So I decided to use different top title margin
# so title was set True for some and None for some.
if title == True:
g = sb.FacetGrid(df_pisa_clean, col = cat1, hue=cat2, col_wrap=3, height= 4)
g.map(sb.scatterplot, num1, num2, alpha=1);
g.add_legend();
plt.setp(g.axes, xlabel='Average_learning_time (mins/week)', ylabel='Academic_performance');
g.figure.suptitle(f"Relationship among student's {num1}, {num2}, {cat1} and {cat2}")
g.figure.subplots_adjust(top=.9)
else:
g = sb.FacetGrid(df_pisa_clean, col = cat1, hue=cat2, col_wrap=3, height= 4)
g.map(sb.scatterplot, num1, num2, alpha=1);
g.add_legend();
plt.setp(g.axes, xlabel='Average_learning_time (mins/week)', ylabel='Academic_performance');
g.figure.suptitle(f"Relationship among student's {num1}, {num2}, {cat1} and {cat2}")
g.figure.subplots_adjust(top=.85)
quadri_var('Average_learning_time', 'Academic_performance', 'Perseverance_Give_up_easily', 'Class_repetition', title=True)
g = sb.FacetGrid(df_pisa_clean, col = 'Perseverance_Give_up_easily', hue="Class_repetition", col_wrap=3, height= 4)
g.map(sb.scatterplot, 'Average_learning_time', 'Academic_performance', alpha=1);
g.add_legend();
quadri_var('Average_learning_time', 'Academic_performance', 'Immigration_status', 'Class_repetition', title=None)
quadri_var('Average_learning_time', 'Academic_performance', 'Perseverance_Give_up_easily', 'Immigration_status', title=True)
From the various analysis carried out and plots visualized, I can say that more than 50% of the students have learnt the school curriculum well enough.
The factors affecting the academic performance of students in the Program for International Student Assessment (PISA) dataset were investigated. The summary of the investigation are as follows: - In the PISA dataset exploration, I discovered that most students that took part in the assessment came from Albania than from the United Arab Emirates, however, students from the United Arab Emirates performed better academically than students from Albania. These students are mostly 15years old and they are more of female gender than male. The assumption of the female students outperforming male students in a range of indicators of academic performance was validated. - The relationship between the two engineered variables; average learning time and academic performance reveal that there is a positive correlation between these two variables, although very weak. This implies that as students spend more time in learning there is the higher probability of the students performing better than they did before in their academic studies. - Moreover, more native students partook in the assessment than either the first-, or second-generation immigrant students. But immigrant students perform better academically than the native students, hence immigration status has an impact on the academic performance of students in the PISA dataset analyzed. Of note is the finding that the both native and immigrant students are most likely NOT to repeat a grade than they are to repeat a grade. For the immigrant students, those who did not repeat a grade performed better academically than those immigrant students that repeated a grade irrespective of whether they are first-, or second-generation immigrants. It can also be noticed that the immigrant students that repeated a grade performed better academically than the native students; irrespective of whether the native students repeated a grade or not. - It is also worth mentioning that students of parents with higher levels of education or spent higher number of years in education, performed better academically than those with lower level of education; as the former may have an enhanced regard for learning, more positive ability beliefs, a stronger work orientation, and they may use more effective learning strategies than students of parents with lower levels of education. - The role of school in preparing students for life was validated as students who performed best academically are prepared for life and better able to make the transition into life beyond school, adulthood and to achieve occupational and economic success. In addition, students that performed the best academically stated that they would likely attribute their failure to teachers not explaining well in class and teachers not getting them interested in the subjects taught. This can be summed up as negative teacher's attitudinal problem. - Meanwhile, it should be emphasized that students who are not likely to give up easily in their academic studies performed better academically than those that tend to give up easily. In addition, students who do not necessarily give up easily, spent more time learning, and thus better academic performance than those who tend to give up easily. Also, whether students REPEATED a grade or NOT, students that are NOT MUCH likely to give up easily performed the best academically while those that are VERY MUCH LIKELY to give up easily are low-performing students. Across all students, students who REPEATED a grade AND are NOT MUCH LIKELY to give up easily scored higher than other students in their academic studies, irrespective of whether the other students repeated a grade or not. - Finally, in this PISA dataset, it is unlikely that student-teacher relationship and classroom management promote student's academic achievement. - Besides the main variable of interest, I also looked into the relationships between average learning time and other variables. I found that immigrant students spent more time learning than the native students. Also, as expected, students who repeated a grade spent more time learning (to cover up for lost ground) than those who did not repeat a grade. It should also be worth noting that students of parents with higher educational years/level spent longer time learning than students of parents with lower educational years/level. - In conclusion, I also discovered that irrespective of a student's immigration status, grade repetition status, parent's educational level, or the tendency to give up easily, students who spent more time learning perform better academically than those that spent less time learning.
- https://www.childtrends.org/indicators/immigrant-children
- https://ec.europa.eu/eurostat/documents/1978984/6037342/ISCED-EN.pdf
- https://books.google.com.ng/books?id=Yw9GAgAAQBAJ&pg=PA261&lpg=PA261&dq=Grade+compared+to+modal+grade+in+country&source=bl&ots=p1qgcM6iNJ&sig=ACfU3U2wEWEaWfFrDf-DDCib6YMPTVDlyw&hl=en&sa=X&ved=2ahUKEwifh6Tinb35AhVGVfEDHRs-AcwQ6AF6BAgYEAM#v=onepage&q=Grade%20compared%20to%20modal%20grade%20in%20country&f=false
- https://medium.com/swlh/identify-outliers-with-pandas-statsmodels-and-seaborn-2766103bf67c
- https://www.quora.com/How-can-I-remove-outliers-in-a-large-dataset-with-pandas
- https://stackoverflow.com/questions/44620465/why-did-reset-indexdrop-true-function-unwantedly-remove-column
- https://www.geeksforgeeks.org/how-to-adjust-title-position-in-matplotlib/
- https://mldoodles.com/matplotlib-pie-chart/
- https://stackoverflow.com/questions/36573789/python-seaborn-facetgrid-change-xlabels
- https://stackoverflow.com/questions/72196032/how-to-customize-histogram-using-seaborn-facetgrid
- https://stackoverflow.com/questions/29813694/how-to-add-a-title-to-seaborn-facet-plot
- https://seaborn.pydata.org/tutorial/color_palettes.html
- https://www.geeksforgeeks.org/replacing-missing-values-using-pandas-in-python/
- https://medium.com/@uknak/school-do-not-prepare-us-for-the-real-world-47adfb25bd9f 15.https://www.theschooloflife.com/article/success-at-school-vs-success-in-life/ 16.https://medium.com/@uknak/school-do-not-prepare-us-for-the-real-world-47adfb25bd9f 17.https://files.eric.ed.gov/fulltext/EJ1266806.pdf 18.https://info.retiredteachers.org/blog/how-does-classroom-management-promote-student-learning 19.https://files.eric.ed.gov/fulltext/EJ1232893.pdf 20.https://www.researchgate.net/publication/339720791_Academic_achievement_among_university_students_The_role_of_causal_attribution_of_academic_success_and_failure. 21.https://samphina.com.ng/impact-teacher-student-relationship-academic-performance/
- https://www.nepjol.info/index.php/jdse/article/download/27958/23066/82820
- https://ec.europa.eu/eurostat/statistics-explained/index.php?title=First_and_second-generation_immigrants_-_statistics_on_education_and_skills.
- Google search engine
- ALX-Udacity Data Analysis Slack forum
fig, ax = plt.subplots(5,2, figsize=(18, 30))
ax1 = df_pisa_clean.groupby(['Highest_educational_level_parents'])['Immigration_status'].value_counts().unstack().plot(kind='barh',
stacked = True, ax=ax[0][0])
plt.gca().invert_yaxis()
ax[0][0].title.set_text("Plot 1B: Highest Educational Level of Parents and Immigration status")
sb.countplot(data=df_pisa_clean, x='Immigration_status', hue='Country', ax=ax[0][1])
ax[0][1].title.set_text("Plot 2B: Relationship between student's Immigration status and Country")
ax2 = df_pisa_clean.groupby(['Highest_educational_level_parents'])['Country'].value_counts().unstack().plot(kind='barh',
stacked = True,ax=ax[1][0])
ax[1][0].title.set_text("Plot 3B: Relationship between Highest Educational Level of Parents and Country")
sb.countplot(data=df_pisa_clean, x='Immigration_status', hue='Gender', ax=ax[1][1])
ax[1][1].title.set_text("Plot 4B: Relationship between student's Immigration status and Gender")
ax3 = df_pisa_clean.groupby(['Perseverance_Give_up_easily'])['Immigration_status'].value_counts().unstack().plot(kind='barh',
stacked = True, ax=ax[2][0])
plt.gca().invert_yaxis()
ax[2][0].title.set_text("Plot 5B: Perseverance(Give up easily) and Immigration status")
sb.countplot(data=df_pisa_clean, x='Class_repetition', hue= 'Immigration_status', ax=ax[2][1])
ax[2][1].title.set_text("Plot 6B: Relationship between student's Immigration status and Class repetition")
ax4 = df_pisa_clean.groupby(['Perseverance_Give_up_easily'])['Class_repetition'].value_counts().unstack().plot(kind='barh',
stacked = True, ax=ax[3][0])
plt.gca().invert_yaxis()
ax[3][0].title.set_text("Plot 7B: Perseverance(Give up easily) and Class repetition")
sb.countplot(data=df_pisa_clean, x='Highest_educational_level_parents', hue='Class_repetition', ax=ax[3][1])
ax[3][1].title.set_text("Plot 8B: Highest Educational Level of Parents and Class repetition");
plt.xticks(rotation=15);
ax5 = df_pisa_clean.groupby(['Perseverance_Give_up_easily'])['Country'].value_counts().unstack().plot(kind='barh',
stacked = True, ax=ax[4][0])
plt.gca().invert_yaxis()
ax[4][0].title.set_text("Plot 9B: Relationship between student's Perseverance(Give up easily) and Country")
ax6 = df_pisa_clean.groupby(['Perseverance_Give_up_easily'])['Highest_educational_level_parents'].value_counts().unstack().plot(kind='barh',
stacked = True, ax=ax[4][1])
ax[4][1].title.set_text("Plot 10B: Perseverance(Give up easily) and Highest Educational Level of Parents")
Plot 1B: In the relationship between highest educational level of parents (HELP) and Immigration status, the plot depicts that among the native students, students whose parent highest educational level is at ISCED 3A and ISCED 4 are more than those whose parents are at other educational level. For all the immigrant students, students whose parents highest educational level is at ISCED 5A and ISCED 6 are more than students of parents at other educational level. The trend follows the order:
- Native students = ISCED 3A, 4 > ISCED 5A, 6 > ISCED 2 > ISCED 5B > ISCED 3B, C > ISCED 1.
- First-generation immigrant students = ISCED 5A, 6 > ISCED 3A, 4 > ISCED 5B.
- Second-generation immigrant students = ISCED 5A, 6 > ISCED 5B > ISCED 3A, 4.
Plot 2B: This plot looked at the relationship between Country and Immigration status. The plot indicate that in Albania, MORE native students partook in the survey than first-, and second-generation immigrant students. In the United Arab Emirates, the reverse is the case as students who partook in the PISA assessment are MOSTLY first-generation immigrant students followed by the native students and finally the second-generation immigrant students.
- ALBANIA = Natives > First-generation immigrants > Second-generation immigrants.
- United Arab Emirates = First-generation immigrants > Natives > Second-generation immigrants.
Plot 3B: The plot depicts the relationship between Country and HELP. The plot show that in Albania, the students whose parent's highest educational level is at ISCED 3A, 4 are more than students of parents of other educational level. However in the United State of Emirate, the students whose parent's highest educational level is at ISCED 5A, 6 are the most, more than students of parents of other educational level.
Plot 4B: This plot seek to answer the question of what the gender distribution of immigrant students are in the PISA dataset. It reveal the relationship between the Immigration status and Gender. From the plot, native students and second-generation immigrant students that partook in the assessment are majorly female while for first-generation immigration students, male gender partook MORE in the assessment than the female gender.
Plot 5B: Is the immigration status of students a strong determinant for student's perseverance level to give up easily? This plot seek to address the question by analyzing the relationship between students who give up easily and the immigration status of the student. The plot illustrates that among native students, those who are NOT likely to give up easily are more than those that are likely to give up easily. However, among the immigrants students; whether first-, or second-generation, those who are NOT MUCH likely to give up easily are more than others who are either very much likely or not likely to give up easily. It is worth noting that the native students who are very much likely to give up easily are MORE than the immigrants students who are NOT MUCH likely to give up easily, irrespective of whether they are first-, or second-generation immigrants.
- Native students = Not at all likely > Not much likely > Somewhat likely ~ Mostly likely > Very much likely to give up easily.
- First-generation immigrant students = Not much likely > Somewhat likely > Not at all likely > Very much likely > Mostly likely to give up easily.
- Second-generation immigrant students = Not much likely > Not at all likely to give up easily.
Plot 6B: This plot addresses the question of how the immigration status of students influences if student will repeat a grade or not. It specifically show the relationship between Class repetition and Immigration status. Among all that partook in the assessment, students who did not repeat a grade are mostly native students while students who repeated a grade are also mostly native students.
- Student who did not repeat a grade = Natives > First-generation immigrants > Second-generation immigrants.
- Students who repeated a grade = Natives > First-generation immigrants == Second-generation immigrants
Plot 7B: In similar manner, this plot addresses the relationship between class repetition and student's perseverance level of giving up easily. As expected, students who do not repeat a grade are NOT much likely to give up easily and those that repeated a grade are VERY MUCH likely to give up easily.
- Students who did not repeat a grade = Not at all likely > Not much likely > Somewhat likely > Mostly likely > Very much likely to give up easily.
- Students who repeated a grade = Very much likely > Mostly likely ~ Somewhat likely > Not much likely > Not at all likely to give up easily.
Plot 8B: The question of how the educational level of parent determine whether a student will repeat a grade or not was answered here. The plot examine the relationship between HELP and Class repetition. The analysis reveal that the students who did not repeat a grade are more than those that repeated irrespective of their parent's highest educational level. The number of students who did not repeat are MOST with students whose parents are at the ISCED 3A, 4 educational level.
Plot 9B: The relationship between student's perseverance level of giving up easily and Country is illustrated in this plot. In Albania, students who are NOT AT ALL likely to give up easily are more than those who are VERY MUCH likely to give up easily in their studies. Moreover, in the United Arab Emirates, students who are NOT MUCH likely to give up easily are more than other forms of giving up easily.
- Albania = Not at all likely > Not much likely > Somewhat likely > Mostly likely > Very much likely to give up easily.
- United Arab Emirates = Not much likely > Somewhat likely > Not at all likely > Very much likely > Mostly likely to give up easily.
Plot 10B: It is true that you cannot give what you do not have. The relationship between students that are likely to either give up easily or not with parent's highest educational level depicts this. The illustration as represented in plot 10B showed that students whose parents have the highest level of education (ISCED 3 and above) are NOT likely to give up easily than students whose parents barely have educational level (ISCED 1).